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Abstract 

Background: Transcription factors regulate numerous cellular processes by controlling the rate of production of 
each gene. The regulatory relations are modeled using transcriptional regulatory networks. Recent studies have 
shown that such networks have an underlying hierarchical organization. We consider the problem of discovering the 
underlying hierarchy in transcriptional regulatory networks. 

Results: We first transform this problem to a mixed integer programming problem. We then use existing tools to 
solve the resulting problem. For larger networks this strategy does not work due to rapid increase in running time and 
space usage. We use divide and conquer strategy for such networks. We use our method to analyze the 
transcriptional regulatory networks of £ coli, H. sapiens and 5. cerevisiae. 

Conclusions: Our experiments demonstrate that: (i) Our method gives statistically better results than three existing 
state of the art methods; (ii) Our method is robust against errors in the data and (iii) Our method's performance is not 
affected by the different topologies in the data. 



Background 

Genes are the smallest functional units of an organism. 
They carry out vital functions in cells by interacting with 
each other and with other molecules. Biological networks 
model such interactions among genes. Using biological 
networks, researchers are able to take a holistic approach 
on the analysis of cellular functions. Such analysis has 
shown that biological networks have a number of global 
properties. One of these properties is their hierarchi- 
cal organization. Hierarchical organization defines a par- 
tial ordering of the underlying genes. Recent studies 
have shown that directed interactions between transcrip- 
tion factors (TFs) in transcriptional regulatory networks 
(TRNs) impose a hierarchy on TRNs [1-5]. Analysis of the 
hierarchies of TRNs helps researchers better understand 
the flow of controlling signals through the transcription 
machinery [1,3]. TFs are special types of proteins that 
control the expression of other genes by binding to spe- 
cific regions of the DNA [6]. Since each protein is coded 
by genes, we will use the terms transcription factor and 
gene to refer to TFs throughout this paper. One way to 
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model hierarchy in TRNs is to assign levels to the inter- 
acting TFs. Figure 1 shows a sample network with its level 
assignments. In this paper, we consider the problem of 
finding the hierarchical organization of a TRN. The formal 
definition of our problem in this paper is as follows: 

Problem definition 

Let us denote a TRN with Q = (N,E). Here, N denotes 
the set of TFs and E denotes the set of directed interac- 
tions (i.e. edges) between the TFs in N. We refer to each 
TF in N as a node. We name the /th node in N as We 
represent an edge from the node n{ to n.j with (nu nj). Also, 
we denote the maximum possible number of levels in Q 
with M. We denote the hierarchy level assigned to a node 
Hi with ti where t{ is an integer in {1, 2, , 3, . . . ,M}. Let 
<p(nu nj) —> {0, 1} be a binary function that describes the 
key topological relationship between ni and nj. (We elabo- 
rate on the <p function below.) We compute a penalty score 
Pij for each pair of nodes as follows: 

[ <§>{n b nj), if tt < tj 
0, else 



Pij = 



Our aim is to find an assignment of hierarchies to the 
nodes of N which minimizes Pij- 
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Figure 1 Hierarchical decomposition of a sample network with 
seven nodes denoted by n-i , nj, • • • , to three levels. Directed 
edges represent the interactions. Dashed line splits the nodes into 
different levels. Each of the seven nodes are assigned one of the three 
existing levels. 



In this paper, we use two different 0 functions describ- 
ing two key topological properties. 

1. Adjacency. We define nj) = 1 if (nu nj) e E 
and nj) = 0 otherwise. 

2. Reachability. We define §(nu nj) = 1 if there exists 
a path from ni to rij in Q by traversing the edges in E. 
We set <p(nu nj) = 0 otherwise. 

Depending on the choice of the two 0 functions, we 
name the resulting distance function adjacency distance 
or reachability distance respectively. In summary, using 
adjacency distance we aim to assign levels such that every 
TF is above the others it directly regulates, and below its 
every direct regulator. On the other hand, using reachabil- 
ity distance, we consider any direct or indirect regulation 
relation between two TFs when assigning levels. 

There has been attempts to devise methods to reveal the 
underlying hierarchies of TRNs. Yu and Gerstein devel- 
oped BFS-level method to carry out this task [1]. This 
method uses breadth first search to assign hierarchies 
to TFs in a network. Although their method works for 
most networks, it fails to assign accurate levels for net- 
works that contain cycles. Jothi et al. developed vertex 
sort method [2]. This method incorporates topological 
sort algorithm for addressing the network hierarchy prob- 
lem. Vertex sort method does not have any restrictions 
on network motifs or cycles. However, rather than a cer- 
tain hierarchy, it assigns a range of possible levels for 
the TFs. Hartsperger et al. devised an algorithm based 
on breadth first search method to solve the problem 
[3]. Their solution improves the BFS-level method, and 
outputs a hierarchy for every network regardless of its 
topological features. However all these algorithms fail to 
minimize the number of edges that violate the hierar- 
chy. We name such edges as conflicting edges. We will 



elaborate on the quality of results calculated by existing 
methods in Section Comparison with existing hierarchical 
decomposition methods. 

Contributions 

In this paper, we develop a novel approach to tackle the 
problem of discovering underlying network hierarchy. We 
first consider the topology of the network as a set of 
constraints. Then, we define two different objective func- 
tions using adjacency and reachability penalty functions. 
We define the minimization of total penalty as the objec- 
tive of the problem. Using the above explanations, we 
transform this problem to a mixed integer programming 
problem(MIPP) [7]. We solve the resulting problem using 
existing MIPP solvers. We name our method Hierarchical 
DEcomposition of regulatory Networks (HIDEN). The 
main advantage of HIDEN is it introduces a sound math- 
ematical formulation to the network hierarchy problem. 
Our formulation can work with any objective function 
that is a linear combination of the edges. One drawback 
of HIDEN is that it does not scale well to very large net- 
works due to the growing size of the MIPP with increasing 
number of TFs. In order to address this issue we develop a 
divide and conquer approach. 

The rest of this article is organized as follows: In 
Section Algorithm, we describe the methods we devel- 
oped in this paper. In Section Results and discussion, 
we discuss the results of HIDEN in detail. Finally, in 
Section Conclusion, we briefly conclude the paper. 

Method 

In this section, we describe the hierarchical decompo- 
sition method we developed. Section HIDEN describes 
our method. Section Example demonstrates HIDEN on 
a simple example. Section Divide and Conquer method 
describes divide and conquer method we employ to scale 
HIDEN to larger networks. 

HIDEN 

HIDEN transforms the hierarchical network decomposi- 
tion problem to a MIPP [7]. Given a TRN, HIDEN first 
constructs a set of linear constraints and a linear optimiza- 
tion function that collectively describe the penalty of the 
decomposition. Then it uses existing optimizer software 
to solve the resulting problem. Next, we will explain how 
we formulate the MIPP. 

Let us denote the given network that will be decom- 
posed with Q. Let us denote the nodes (i.e., TFs) of this 
network with n\, n^, . . n m , where m is the total num- 
ber of nodes of Q. HIDEN, allows the user to set a limit 
on the maximum number of allowed levels for hierarchi- 
cal decomposition. Let us denote this number with M. 
Also, let us represent the level assigned to node ni with 
ti for all i e {1,2,..., m], (i.e. Vf, U e {1,2,3, ... ,M}). 
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We aim to find the level assignment T = {t\, t^, . . . ,t m } 
that minimizes the total penalty resulting from this level 
assignment. Therefore, the objective of our problem is the 
sum of individual penalty scores for each pair of nodes: 



minimize 



E 

l<i,j<m 



Pir 



(1) 



Pij 



Pij 



Next, we set a limit on the number of levels in the hier- 
archy. We do this by limiting the variables t{ as follows: 

0<ti< M. (2) 

We, then, represent each py as a linear constraint. 
Remember that py is a binary function in the following 
form: 

(p(rii,nj), if ti < tj 

0, else 

We can rewrite this function as follows: 

1, if ti < tj and (j){nu nj) = 1 

0, else 

Let us only consider the cases where nj) = 1. We 
can represent the rest of this function using two linear 
inequalities. The following set of constraints represent the 
function py. 

Pij£ {0,1} (3) 



tj -ti-M x p^ > -M 
tj - ti-M x p^ < -1 



(4) 



1 (5) 
In order to prove that these inequalities model the func- 
tion p^ correctly, we need to inspect all possible scenarios: 

1. if ti > tj and py = 0, then -1 > tj - t t > —(M - 1) 
and M x p^ = 0. Therefore both (4) and (5) holds. 

2. if ti < tj and py = 0, then tj — ti > 0 and M x py = 0. 
Therefore, (4) holds, however, (5) does not hold. 

3. if ti > tj and py = 1, then -1 > tj - t t > —(M - 1) 
and M x py = M. Therefore the expression 

tj — ti — M x p^ is smaller than or equal to — M — 1. 
This implies that (4) does not hold but (5) holds. 

4. if ti < tj and py = 1, then (M - 1) > tj - £/ > 0 and 
M x py = M, therefore both (4) and (5) holds. 

Therefore, enforcing the constraints (3), (4) and (5) 
implies: 

(Pij = 0oti> tj) A (p^ = l^ti< tj) (6) 

This corresponds to the latter definition of the func- 
tion p^ except the condition of ^(n^rij) = 1. Since we 
choose the function <p prior to the construction of the 
MIPP, we know the value of <p (nu nj) for every pair (nu nj). 
Therefore, we can manually ensure this property, by only 



considering p^ where ^(n^nj) = 1 and excluding py 
completely from our calculations where <p (nu nj) = 0. 

Based on the constraints above, the MIPP we construct 
to solve the network hierarchy problem is as follows: 

minimize p^ 

i,j s.t. (p{ni,nj)=\ 

where 

Vm 

0 < ti < M 
ti is an integer 
V«;, rij such that nj) = 1 
Pij e {0,1} 

tj -ti-M x p^ > -M 
tj -ti-M x p^ < -1 

Example 

In this section, we show the application of HIDEN on 
the network in Figure 1. We will use adjacency penalty 
function in this example. Therefore: 



0 



1 if an edge from rii to nj exists 
0 otherwise 



Using this <p function, the objective of the MIPP is to 
minimize the following function: 

J] Pij = P12+P13+P15+P24+P35+P36+P67+P73 

i,j s.t. (p(n if nj)=l 

Now we go over to the constraints. First set of con- 
straints limit tf. 

Vtti, i from 1 to 7 

0 < ti < M 
ti is an integer 
Then, we write the remaining functions as follows: 

V(ni,nj) e {(ni,n 2 ), (ni,n 3 ), (ni,n 5 ), («2>«4)> 
(« 3 , « 5 ), (« 3 , n 6 ), (n 6 , n 7 ), (n 7 , n 3 )} 

Pij € {0,1} 

tj -U-Mx p^ > -M 
tj -ti-Mx p^ < -1 

In the resulting problem, M is left as a user defined 
parameter. When we run the above problem with M = 4, 
HIDEN returns the following result: 
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P12 + P13 + ^15 + P24 + /*35 + /*36 + P67 + P73 = 1 

(h, t<i, £4, £5, t& t-j) = (4, 3, 3, 2, 2, 2, 1) 

Figure 2 shows the result of HIDEN on the given net- 
work. Note that HIDEN computes the level decomposi- 
tion successfully despite the existence of a cycle in the 
network. 

Divide and Conquer method 

HIDEN works well for networks that have up to 100 
nodes. For larger networks, however, it becomes difficult 
to solve the resulting MIPP using current hardware. This 
is mainly because the number of integer variables of the 
MIPP that describe the problem for the given network 
increases. This increases the memory consumption and 
the running time significantly. 

In order to solve our problem for networks that have 
more than 100 nodes we adopt a divide and conquer 
approach. Given a large TRN, we randomly divide this 
network into fixed size partitions. We do this by first ran- 
domly selecting a node from the given network. This node 
is the seed of the first partition, and thus it is a member of 
that partition. We then chose the remaining nodes in that 
partition iteratively by randomly growing the partition 
one node at a time. More specifically, at each iteration, we 
randomly select a node that is not selected so far and that 
is interacting with at least one of the nodes in the parti- 
tion. We repeat these iterations until the number of nodes 
in the partition reaches to a predefined threshold or all the 
nodes in the TRN are assigned to a partition. Then, we 
use HIDEN to decompose the subnetwork denned by the 
nodes and the edges in this partition into hierarchical lev- 
els. Once we determine the levels of all the nodes in the 
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Figure 2 Result of the hierarchical decomposition of the 
network in Figure 1 using HIDEN. Note that the decomposition 
differs from the decomposition in Figure 1 . 



current partition, we store those values as they will remain 
unchanged in the rest of our solution. Next we randomly 
pick another node from the given TRN among those that 
have not been considered yet as the seed of the next parti- 
tion. We grow the next partition similarly and use HIDEN 
to decompose it into hierarchical levels. We repeat these 
steps until we exhaust all the nodes in the given TRN. 

This method greatly reduces the running time of 
HIDEN on large networks. Since MIPP is NP-hard, 
depending on the size and the connectivity of the given 
TRN, the divide and conquer strategy can be orders of 
magnitude faster than the unpartitioned HIDEN. How- 
ever, due to random selection of the nodes, it is possible 
for us to not achieve the optimal result. This is possi- 
ble if the partition of the network we start with does not 
intersect with one or more of the levels in its underlying 
hierarchy. It is worth mentioning that this probability is 
usually very low. We can explain this as follows. Consider 
an N node network which contains n nodes belonging 
to a specific level x. If we select k nodes among these N 
nodes randomly, the probability that none of the k nodes 
belong to level x is (N-n choose k)/(N choose k). As k 
or n increases, this expression quickly converges to zero. 
In order to reduce this probability further, we repeat the 
divide and conquer strategy multiple times, each time 
starting from a randomly selected node. In our implemen- 
tation, we repeat this process 1000 times for real TRNs. 
After 1000 iterations, the probability of all the trials start- 
ing with an undesired (i.e. does not intersect with all 
the final levels) partition becomes very small (i.e. if for 1 
iteration, the probability is as high as 0.9, after 1000 iter- 
ations, the probability becomes 0.9 1000 ~ 10" 46 ). Since 
the running time of partitioned HIDEN is orders of mag- 
nitude less than that of the unpartitioned HIDEN, 1000 
repetitions remains to be practical. It took less than 10 
minutes for the largest dataset (S. cerevisiae). Our experi- 
ments showed that on the average, the results of the divide 
and conquer method reach its optimum in less than 500 
iterations. 

Results and discussion 

In this section, we evaluate HIDEN using a number of 
computational tests. In our tests, we let the underlying 
MIPP solver to handle the case of multiple optimal results. 
We only consider the unique result reported by the solver 
in our discussions. In the rest of this paper, we will use 
the term experiment to refer to in silico experiments for 
simplicity. 

DATASET In our experiments, we used TRNs of E. 
coli, H. sapiens and S. cerevisiae. We used the previously 
constructed networks, used by existing methods to test 
our method [1-3,5]. Earlier studies used existing interac- 
tion data to construct these three networks [8-17]. For 
the experiments that require gene function information, 
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we used the information included in the Gene Ontology 
Database [18]. We downloaded the list of essential genes 
for S. cerevisiae from the database of essential genes [19]. 

In the rest of this section, we first compare HIDEN 
with other existing hierarchical decomposition methods 
in Section Comparison with existing hierarchical decom- 
position methods. In Section Biological evaluation of 
network hierarchies we evaluate the results our method 
using a number of biological properties of TFs. Finally in 
Section Effects of input on HIDEN, we analyze the behav- 
ior of our algorithm with respect to different quantitative 
properties of the data. 

Comparison with existing hierarchical decomposition 
methods 

The objective of hierarchical decomposition is to arrange 
the TFs of a given network to levels so that the gene that 
alter the activity of the other appears at a higher level than 
the other throughout the network as frequently as possi- 
ble. The two <p functions described at the beginning of this 
paper model this relationship in terms of the adjacency 
and the reachability of the nodes in the given network. 
In this experiment we evaluate how well our method, 
HIDEN, compares against three state of the art meth- 
ods, namely vertex sort [2], HiNO [3] and BFS-level [1], in 
achieving this objective. To perform this comparison, we 
compute the penalty values obtained by HIDEN when it 
is applied on S. cerevisiae, E. coli and H. sapiens networks. 
We compute the same penalty values for the vertex sort, 
HiNO and BFS-level methods on the same three datasets 
for which their hierarchical decompositions are available. 

The penalty is a quantitative value that can be used to 
compare different methods on the same dataset. How- 
ever, since the size (number of genes and interactions) and 
the topology of these networks deviate significantly, the 
resulting penalties will differ significantly across datasets. 
In order to report a statistically sound value that describes 
the success of a method independent of the network 
size and topology, we also compute the Z-scores of the 
resulting penalty values. 

Let us denote the level assignment obtained by a specific 
method for an m node network with T = {t\, t2, • • • ,t m }. 
Let y denote the penalty of T according to a specific 0 
function. In order to compute the Z-score for T, we ran- 
domly produce many level assignments using the same 
level distribution as that of T. For each such assignment, 
we compute the resulting penalty value using the same 0 
function. Let ijl and a denote the mean and standard devi- 
ation of the resulting penalty values of all these random 
level assignments respectively. We calculate the Z-score as 
follows, 

z = (7) 

<7 



A higher Z-score implies a better level assignment. Typ- 
ically, a Z-score of four or higher is very significant as 
they indicate a result which is 4 or more standard devi- 
ations more extreme than the mean Table 1 summarizes 
the penalties and the corresponding Z-score values. For 
HIDEN, we reported the results for each of the six val- 
ues of maximum number of levels (M = {3, 4, • • • , 8}). For 
other methods the number of levels is not a configurable 
parameter. Hence, we reported the level that we obtained 
after execution of that method. We discuss the results for 
each organism next. 

S. cerevisiae 

We compared HIDEN with all the three competing meth- 
ods for this dataset. Our method outperformed all the 
three methods in terms of both adjacency and reacha- 
bility penalty values as well as the Z-scores regardless of 
the number of levels. As the number of levels allowed 
increases, the penalty incurred by HIDEN monotonically 
decreases. This, however, is not true for the Z-score as 
it depends on the distribution of nodes to levels. For 
instance HIDEN attains the highest Z-score for adjacency 
penalty at level eight whereas it attains that using only 
six levels for the reachability penalty. The biggest drop 
of penalty takes place when the number of allowed lev- 
els increases from three to four. We observe further, yet, 
smaller improvement in the penalty as the number of 
allowed levels increases beyond four. 

Among the competing methods, the vertex sort method 
of Jothi et al incurs the lowest penalty. It, however, uses 
significantly more levels than the HiNO and BFS-level 
methods. Furthermore, although it uses more levels than 
HIDEN as well, it performs worse than HIDEN in terms of 
both penalty and Z-score measures. Among the remaining 
two methods, HiNO and BFS-level, there is no clear win- 
ner. BFS-level leads to slightly less penalty at the expense 
of an additional level. As a result, HiNO produces slightly 
better Z-scores. 

E. coli 

For this dataset, we compared HIDEN with all three exist- 
ing methods. The penalty values of all the methods for 
Exoli are smaller compared to those of S. cerevisiae. This 
is mainly because E. coli network is much smaller. HIDEN 
performs the best among all methods for four or more lev- 
els according to both penalty and Z-score values. We did 
not observe any improvement for HIDEN beyond seven 
levels. Vertex sort attains statistically better results than 
HiNO and BFS-level methods. 

H. sapiens 

We compared HIDEN with vertex sort and BFS-level 
methods for this dataset. We omitted HiNO in this exper- 
iment because we could not run it on this dataset. The 
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Table 1 Comparison of HIDEN with three other methods on different networks 



Organism Method 


Num. 


Adjacency 




Reachability 






Level 


Penalty 


Z-score 


Penalty Z-score 




3 


140 


10.8 


3600 


13.9 




4 


103 


10.8 


3027 


14.6 


HIDEN 


5 


88 


9.5 


2774 


14.1 


6 


91 


10.2 


2573 


13.4 


Yeast 


7 


79 


9.8 


2469 


12.8 




8 


79 


11.6 


2365 


13.7 


vertex sort 


9 


179 


7.3 


3920 


10.2 


BFS-level 


4 


245 


6.0 


5734 


9.9 


HiNO 


3 


279 


6.8 


6205 


10.5 




3 


15 


6.3 


19 


7.1 




4 


8 


6.5 


10 


6.8 


HIDEN 

E. coli 


5 


5 


6.4 


7 


6.7 


6 


5 


6.2 


6 


6.6 


7 


5 


6.2 


5 


6.6 




8 


5 


6.2 


5 


6.6 


vertex sort 


6 


10 


5.7 


11 


6.5 


BFS-level 


4 


44 


3.7 


65 


5.1 


HiNO 


4 


41 


4.2 


59 


5.3 




3 


101 


7.4 


1950 


9.4 




4 


84 


7.4 


1608 


10.6 


HIDEN 

Human 


5 


75 


7.9 


1435 


10.8 


6 


66 


7.9 


1347 


10.2 




7 


72 


7.5 


1287 


9.7 




8 


72 


7.3 


1248 


9.9 


vertex sort 


5 


207 


1.2 


2162 


5.8 


BFS-level 


3 


210 


0.72 


2163 


6.0 


a For HIDEN we vary the maximum allowable level from three to eight. We report the adjacency and reachability penalties as well as the Z-scores for these penalties for 
each experiment. "Num. Level" denotes the maximum number of allowed levels. The results for HiNO on human are omitted, because of problems in execution. 



results follow a similar pattern as those of the two other 
datasets. HIDEN outperformed vertex sort and BFS-level 
even when it used fewer levels. The gap between the 
Z-scores of HIDEN and the other methods was even 
more significant than the previous datasets. HIDEN led 
to the highest drop of penalty of from three to four lev- 
els and continued to improve with increased number of 
levels. 

We conclude that, HIDEN performs significantly bet- 
ter than the competing methods for all the three major 
datasets. 

Biological evaluation of network hierarchies 

In this section, we analyze HIDEN using biological evi- 
dence. First, we check functional properties of genes 
across different levels. Then, we evaluate the locations of 
essential genes in the hierarchy. 



Functions of genes 

TRNs regulate the expression of genes that take part in 
many processes in an organism [13]. Earlier works suggest 
that the concentration of genes participating in certain 
functions are closely related to the level in the hierar- 
chy [1]. However, majority of cellular functions in the cell 
take place through multiple reactions happening in suc- 
cession. Therefore, we expect a uniform distribution of 
functions among different levels. In order to confirm this 
theory, we calculated the functional enrichment score of 
every single level in the hierarchies we discovered. We first 
decomposed each of the H. sapiens, E. coli and S. cere- 
visiae TRNs to each of the three to eight levels. Then, 
for the resulting 18 combinations (i.e., 3 organisms and 
6 levels), we calculated a p-value for each gene ontology 
term and level pair. We obtain the gene ontology terms 
from the Gene Ontology database [18]. We calculate these 
p-values assuming a hypergeometric distribution of the 
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Figure 3 The p-values for the observed number of genes 
annotated with the wound healing process at each level for the 
H. sapiens TRN. The network is divided into six levels using HIDEN 
with reachability as penalty scheme. 



gene ontology term over all the TFs in the network. We 
observed that the p-values were similar at all levels of 
the hierarchy (see Figure 3). This supports our initial the- 
ory that majority of the functions the TFs in our network 
participate are not enriched at any level. One example 
among many is the wound healing function in human net- 
work [20]. However, in rare instances, we observed some 
functions being moderately enriched in some levels. For 
example, third of the eighth level (the third level when we 
decompose the network into eight levels) of human TRN 
is enriched with the signal transduction function. How- 
ever, we could not detect any other levels in any other 
network enriched with this function. 

Each gene in an organism takes part in at least one 
metabolic function. A gene participating in a large num- 
ber of reactions is a common phenomena in many organ- 
isms. In this experiment, we compare the level of each 
gene with the number of functions they participate in. By 
doing so, we aim to discover any existing relation between 
the two. In order to do this, we use the gene ontology 
database [18]. The participation of a gene in a reaction 
is represented using gene ontology annotations in the lit- 
erature. For each gene in our networks, we first count 
the number of gene ontology terms it is annotated with. 
We also decomposed each network into six layers using 
HIDEN. Then, we visualized the networks using hierar- 
chy information as location and functional information as 
color of each node. Figures 4, 5 and 6 show our results. 
Our results suggest that there is no strong correlation 
between the number of functions of each gene and the 
level of the gene in the hierarchy. However, in all three 
organisms, we observed that the genes with the highest 
number of annotations tend to lie at the middle levels (i.e. 
2,3 or 4). This result indicates that regulatory hubs in the 



Figure 4 Illustration of the distribution of the number of 
functions that each gene participates in for the TRN of H. 
sapiens. Each circle represents a TF. The network is divided into six 
levels using the reachability as the penalty function and placed in 
relevant levels. The horizontal lines separate the TFs to different levels. 
The genes are colored according to the number of Gene ontology 
terms they are annotated with in gray scale. The least number of 
functions is assigned the color black, where the largest number of 
functions is assigned the color white. 



TRNs are not at the top levels. They are rather at the 
middle levels of the hierarchy. 

Gene Essentiality 

The genes which an organism cannot survive without are 
called essential genes [19]. Such genes take part in vital 
functions in the cell. Earlier works proposed that the num- 
ber of essential genes is strongly correlated to its location 
in the hierarchy [2] . In this experiment, we aim to find out 
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Figure 5 Illustration of the distribution of the number of 
functions that each gene participates in for the TRN of S. 

cerevisiae. Each circle represents a TF. The network is divided into six 
levels using the reachability as the penalty function and placed in 
relevant levels. The horizontal lines separate the TFs to different levels. 
The genes are colored according to the number of Gene ontology 
terms they are annotated with in gray scale. The least number of 
functions is assigned the color black, where the largest number of 
functions is assigned the color white. 
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Figure 6 Illustration of the distribution of the number of 
functions that each gene participates in for the TRN of E. coli. 

Each circle represents a TF. The network is divided into six levels using 
the reachability as the penalty function and placed in relevant levels. 
The horizontal lines separate the TFs to different levels. The genes are 
colored according to the number of Gene ontology terms they are 
annotated with in gray scale. The least number of functions is 
assigned the color black, where the largest number of functions is 
assigned the color white. 



if there exists any such relation. In order to do this, we 
count the number of essential genes at each level of hierar- 
chy in a five level decomposition of S. cerevisiae TRN. We 
then report the ratio of number of essential genes to total 
number of genes in a level in the hierarchy. We also cal- 
culate P-values for the number of genes observed in each 
level to show how significant the observations are. Figure 7 
shows our results for this experiment. We observe that 
there is a higher density of essential genes at the middle 
levels of the hierarchy. We also observe that, the P-values 
we calculated show that the level three (of maximum four) 



is highly enriched in essential genes. This result, com- 
bined with the results of the previous experiment support 
our theory that regulatory hubs of an organism are often 
at the middle levels of the hierarchy, rather than the top 
level. Indeed strong correlation between hubs and essen- 
tiality has been observed in the literature that supports 
our results [21]. 

Figure 8 shows a subnetwork of the human TRN. The 
highlighted TFs are shown to have abnormalities in many 
types of cancers. c-Myc is a TF which has a key role 
in cell proliferation [22]. Overexpression of c-Myc may 
result in development of different types of cancers. TP53 
is an essential gene which is regulated by c-Myc. The 
expression of this gene prevents formation of tumors by 
activating DNA repair, inhibiting cell growth and finally 
inducing apoptosis [23]. TP53 executes apoptosis by acti- 
vating caspases (i.e. CASP8, CASP3) [24]. FLU is another 
protein regulated indirectly by c-Myc, The fusion of pro- 
teins EWSR1IFLI1 and EWSR1/ERG due to a mutation 
creates a master regulator for the development in Ewing's 
Sarcoma [25,26]. EWSR1/FLI1 causes tumor formation by 
upregulating genes that are involved in cell proliferation 
(i.e. IGF1) and downregulating genes that control apop- 
tosis and growth inhibition (i.e. IGFBP3, TGFBR2) [27]. 
These small scale observations support our previous justi- 
fications that regulatory hubs and essential genes are more 
likely to be situated in the middle layers of the TRNs. 

Effects of input on HIDEN 

In this section, we analyze HIDEN by changing the input 
of the algorithm. In order to do this, we first change the 
number of layers we decompose the network into. Then, 
we assume errors and uncertainties in input networks. 
Using our results, we explain how reliable our method is 
under different conditions. Finally, we discuss the quality 
of our results for different subnetworks. 

Navigation of genes across levels in varying hierarchies 

The location of a gene in the hierarchy depends highly 
on the total number of levels. This leads to the follow- 
ing important question: How much can we rely on the 
relative levels of genes? One key feature of our method 
is that it allows the user to specify the number of levels 
in the hierarchical decomposition of the given network. 
By exploiting this feature, next, we answer this question. 
Particularly, we show how the change the number of lev- 
els affect the locations of the nodes in the hierarchy. In 
order to do this, we first calculate the levels of every node 
for S. cerevisiae, E. coli and H. sapiens networks in a six 
level hierarchy. We use these calculations to place every 
node in their respective positions. We then decompose 
the same networks to five levels. We use the result of the 
second decomposition to assign colors to each node in the 
network. Figure 9 shows the results of this experiment for 



£ 0.28- 
c 

CD 

O 0.26- 
o 

0 0.24- 

-Q 

E 

-| 0.22- 

£ 0.2- 
c 

CD 

& 0.18- 

1 0.16- 

CO 
CO 

w 0.14- 
o 

_g 0.12- 
E 

i 0.1- 



2 

Level 



-0.9 

-0.8 

-0.7 

-0.6 

-0.5 o 
"cc 

-0.4 I 

-0.3 

-0.2 

-0.1 

-0 



Figure 7 The ratio of essential genes(solid boxes) and the P- 
values(dashed line) for the number of essential genes observed 
in S. cerevisiae TRN in each level of the hierarchy. The network is 
divided into five different levels using the reachability penalty. The 
P-values are calculated based on the hypergeometric distribution. 
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Figure 8 The TRN of H. sapiens with a subnetwork related to cancer highlighted. In this subnetwork, external signals (i.e. Growth factors, other 
proteins and molecules) regulate or affect the proteins c-Myc, FLU and ERG. Many other regulatory connections and transcription factors are omitted 
for simplicity. 



S. cerevisiae. Our results demonstrate that for the majority 
of the genes, the relative position between different genes 
is preserved. In different decompositions, discovering 
genes in the same relative positions with respect to other 
genes suggest the accuracy of our method for the relevant 
genes. However, there exists some genes that violate this 
relationship. For example, in Figure 9, nodes YGL013C, 
YMR280C and YKL109Waie at least two levels away from 
where they were earlier. Therefore, we conclude that the 
predicted levels of these genes not as reliable as the others. 
This approach can be used for evaluating the reliability of 
our results. Figures 10 and 11 present similar results for E. 
coli and H. sapiens. 

Robustness of HI DEN 

One weakness of all hierarchical decomposition methods 
arises from the nature of the biological network datasets 
that they are incomplete and imprecise. As a result, the 
actual network topology observed can be slightly dif- 
ferent than what is given in existing network databases 



[28]. Furthermore, studies demonstrate that the network 
topologies can vary due to many other factors such as 
external perturbations [29] and varying genetic profiles 
and disorders [30] even within the same species. This 
raises the question that how much can we rely on a hierar- 
chical decomposition if the topology of the given network 
is not accurate? 

This section evaluates the robustness of HIDEN with 
respect to inaccuracies in the given network. In order to 
do this, we carry out the following steps. Given a network, 
we first find its hierarchical decomposition, denoted by T. 
We then create many mutant networks from this network 
for a given mutation percentage using the degree preserv- 
ing edge shuffling model [31]. This is the state of the art 
network alteration method that preserves the degree dis- 
tribution of the network. We elaborate on this method 
later in this section. Thus, each mutant network denotes a 
potential precise network that is different than the origi- 
nal network. Using the topology of each mutated network, 
we compute the penalty of the hierarchical decomposition 
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Figure 9 Illustration of the navigation of genes across levels fortheTRN of S. cerevisiae. Each circle represents a gene. The locations represent 
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T we found at the first step. Thus, this penalty shows how 
bad our results are if our network is inaccurate. We repeat 
this for many mutant networks and report the average of 
their penalties. 

Briefly, we mutate a given network Q as follows. Let 
(u, v) and (5, t) denote two randomly selected edges from 
this network such that (i) u and v are different; 5 and t are 
different, and (ii) the edges (u, t) and (5, v) do not exist in 
Q. We remove edges (u, v) and (5, t) and add edges (u, t) 
and (5, v). Let r\ denote the number of times we repeat this 
edge shuffling process. Then the mutation percentage of 
the original network is ^ x 100% rounded to the nearest 
integer. 

We conducted the experiments on S. cerevisiae, E. coli 
and H. sapiens and on both penalty metrics adjacency 
and reachability for different number of levels of hierar- 
chy. Figure 12 summarizes the results for S. cerevisiae, E. 
coli and H. sapiens using the adjacency and reachability 
penalties when three, six or eight levels are allowed. 



The most important observation that follows from our 
results is that the Z-score remains high even after we 
mutate the network by 20%. We observe a slight drop 
as the mutation rate increases, yet the results remain 
statistically significant. This observation holds for small 
(3), medium (6) and large (8) number of allowed hier- 
archical levels. This result has two major implications. 
First, HIDEN is extremely robust with respect to net- 
work mutations. It was able to identify hierarchical struc- 
ture using the clues that remain in the topology of 
the given network after all mutations take place. Thus, 
even if the original network may be imprecise, the 
decomposition found by HIDEN will be a true decom- 
position with a high probability. Second, the degree 
preserving edge shuffling does not affect the decompos- 
ability of the network. The fact that even the original 
level assignment T yielded statistically significant penal- 
ties on the mutant network proves that it is possible to 
find another decomposition T of the mutant network 
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Figure 10 Illustration of the navigation of genes across levels for the TRIM of £. coli. Each circle represents a gene. The locations represent the 
levels of the genes in a 6-level decomposition, whereas colors of the genes represent their locations in a 5-level decomposition. The color red 
represents the bottom level in the hierarchy, green represents the topmost level and the gradient of colors in between is used to color the nodes in 
between. 



that is statistically at least as significant as (possibly more 
significant than) T. 

Stability ofHIDEN to network mutations 

So far, we have observed that HIDEN was able to decom- 
pose the networks of the given three organisms success- 
fully. This observation along with our last conclusion 
from the previous section begs the following question: 
Can HIDEN decompose the mutant networks or was 
there a bias in topology of these three networks in favor 
of HIDEN? In other words, how stable is HIDEN with 
respect to alterations in the network topology? 

In order to evaluate the stability of HIDEN with 
respect to network alterations, we carry out the follow- 
ing steps. Given a network Q, we create many mutant 
networks Q' from Q for a given mutation percentage 
using the degree preserving edge shuffling. We then use 
HIDEN on each such Q' to find a new hierarchical level 
assignment T specifically for that topology. Thus, this 
penalty shows how much the performance of HIDEN 



is affected from network alterations. We repeat this for 
many mutant networks and report the average of their 
results. 

Tables 2, 3 summarize the penalties and the correspond- 
ing Z-scores for varying mutation percentages as well as 
varying maximum number of allowed hierarchical lev- 
els with according to adjacency and reachability penalties 
respectively. For all the three organisms, we observe sim- 
ilar patterns in our experiments. The most important 
observation is that HIDEN achieves very high Z-scores 
at all mutation rates. Furthermore, these Z-scores are 
comparable to those of the original network (i.e., muta- 
tion percentage = 0%). The adjacency penalty values are 
also comparable to those for the original network. This 
coincides with the observation we made in the robust- 
ness test in Section Robustness of HIDEN that the degree 
preserving edge shuffling does not alter the decompos- 
ability of the given network. As the mutation percentage 
increases, Z-score and the adjacency penalties do not 
show a clear trend to increase or decrease. We, thus, reach 
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Figure 1 1 Illustration of the navigation of genes across levels for the TRN of H. sapiens. Each circle represents a gene. The locations represent 
the levels of the genes in a 6-level decomposition, whereas colors of the genes represent their locations in a 5-level decomposition. The color red 
represents the bottom level in the hierarchy, green represents the topmost level and the gradient of colors in between is used to color the nodes in 
between. 



to the conclusion that HIDEN is stable with respect to 
network alterations. 

/.oca/ versus global hierarchy of subnetworks 

The entire biological network of an organism can be con- 
sidered as a (possibly overlapping) collection of smaller 
subnetworks where each subnetwork corresponds to a 
coherent functional group. For instance, cell cycle net- 
work describes the interactions that take place during the 
division and replication of a cell to produce new cells. 
Similarly, meiosis network describes a special type of cell 
division only observed in reproductive cells. These smaller 
subnetworks may follow a hierarchical structure as well 
within their local topologies. Clearly, we can use HIDEN 
on each of these subnetworks to find their hierarchical 
structure by isolating them from the rest of the network 
one by one. We call such hierarchical decomposition as 
local hierarchy since it only depends on the topology of the 
subnetwork. We call the hierarchical decomposition we 



obtain for a subnetwork from the entire networks topol- 
ogy as its global hierarchy. In this experiment, we evaluate 
how well the global hierarchy of a subnetwork fits to its 
local hierarchy. 

Let us denote the entire network with Q and a subnet- 
work of Q with Q f . Let us denote the level assignments for 
the networks Q and Q' by HIDEN with T and T' respec- 
tively. Let T C T be the global hierarchy of Q' induced 
from T. We calculate the adjacency penalty and Z-score of 
T and T' using the topology of Q' . Table 4 summarizes the 
results for S. cerevisiae for two major subnetworks, namely 
cell cycle and meiosis taken from the KEGG database 
[32] with different values of maximum number of allowed 
levels. 

The results demonstrate that the local hierarchy is bet- 
ter than the global one. This is not surprising as the 
global hierarchy is determined based on the entire net- 
work. Thus, the levels are determined with the goal of 
optimizing all the interactions in the network. On the 
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other hand, local hierarchy is determined only based on 
the restrictions asserted by the corresponding subnet- 
work. We observe that the gap between the local and the 
global hierarchy is small for the cell cycle network. It is, 
however, significant for the meiosis network. In order to 
understand the factors that contribute to this gap, we per- 
formed a detailed analysis of the topology of the entire S. 
cerevisiae network as well as these two subnetworks. Cell 
cycle contained 54 genes and 108 interactions. Meiosis 
was smaller, containing 44 genes and 62 interactions. We 
define an interaction from a gene that is not in the subnet- 
work to a gene that is in the subnetwork as an incoming 
edge. If the interaction points the opposite direction, we 
define it as an outgoing edge. We computed the number 
of incoming and outgoing edges to each subnetwork. The 
number of incoming edges per node was 1.9 and 3.6 for 



cell cycle and meiosis respectively. That for the outgoing 
edges was 20.6 and 18.8 respectively. This suggests that as 
the number of incoming edges increase, the chance that 
the global hierarchy deviates from the local one increases. 
This is indeed intuitive as the incoming edges influence 
the hierarchy much more than the outgoing edges. For the 
local hierarchy, we observe that a small number of levels 
is sufficient to get a good hierarchical decomposition. For 
instance, HIDEN resolved all conflicts for cell cycle in only 
five levels. It resolved all but one conflict for meiosis in 
four levels. 

These results demonstrate that the local and global hier- 
archies can deviate significantly depending on the topo- 
logical relationship between the subnetwork and the rest 
of the network. Thus, detailed analysis of both decom- 
positions can yield useful information regarding how the 
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Table 2 Stability Experiment for increasing mutation 
percentages: The numbers in parenthesis is the average 
adjacency penalty 



Organism 


Level 




Mutation [%] 






0 


5 


10 


20 


40 




3 


9.10 


11.63 


10.17 


12.12 


10.10 






(118) 


(137) 


(127) 


(127) 


(130) 


Yeast 


4 


9.31 
(99) 


11.20 
(117) 


11.30 
(103) 


11.37 
(114) 


10.91 
(104) 




5 


9.26 


10.96 


10.97 


1 1.62 


10.67 






(84) 


(108) 


(92) 


(103) 


(88) 




3 


5.76 


4.98 


5.26 


4.98 


5.34 






(17) 


(16) 


(22) 


(16) 


(15) 


E. coli 


4 


5.43 
(11) 


4.77 
(15) 


5.67 
(14) 


4.77 
(15) 


5.57 
(10) 




5 


5.46 


4.72 


5.52 


4.72 


5.34 






(9) 


(15) 


(12) 


(15) 


(8) 




3 


7.44 


9.24 


8.66 


7.79 


9.14 






(101) 


(105) 


(95) 


(106) 


(107) 


Human 


4 


7.37 
(84) 


9.09 
(90) 


8.14 
(83) 


8.22 
(92) 


8.98 
(92) 




5 


7.90 


8.83 


7.70 


8.53 


9.33 






(75) 


(93) 


(81) 


(73) 


(86) 


a The numbers above them is the corresponding Z-score. Level indicates the 



maximum number of allowed levels. 

functions of a given subnetwork is depends on the other 
genes. HIDEN is capable of revealing such information. 

Conclusion 

In this paper, we took a novel approach to the prob- 
lem of discovering underlying network hierarchy. We first 
transformed our problem to a MIPP. Then, we solved 
this problem using existing optimizers. We named this 
method Hierarchical DEcomposition of gene regulatory 
Networks. However, due to the growing size of the MIPP 
with increasing number of genes, we encountered scala- 
bility issues. We proposed a divide and conquer approach 
to tackle such problems. Later, we experimentally showed 
that our algorithm outperformed existing solutions in 
terms of minimizing conflicting edges in hierarchy. We 
also evaluated our method using biological and statisti- 
cal tools. Then, we discussed the relation between the 
hierarchy of a gene in a TRN and its location in cell, essen- 
tiality and function, based on our experimental results and 
biological evidence. 

Availability and requirements 

The source code for HIDEN can be found in Additional 
file 1. The code is written in C++. The code requires 
IBM ILOG CPLEX version 12 or higher to compile and 
run. Please refer to the documentation of CPLEX for 



Table 3 Stability Experiment for increasing mutation 
percentages: The numbers in parenthesis is the average 
reachability penalty 



Organism Level 



Yeast 



Mutation [%] 

0 5 10 20 40 

12.35 15.21 14.62 15.20 15.31 

(3674) (3600) (3483) (3599) 3598 



12.33 
(3027) 



14.74 
(3026) 



14.47 
(2923) 



14.73 
(3025) 



14.66 
(3024) 



12.27 
(2754) 
7.73 
(21) 



14.18 
(2773) 
6.58 
(26) 



14.29 
(2644) 
6.90 
(21) 



14.18 14.18 

(2772) (2771) 

6.58 6.37 

(26) (27) 



E. coli 



7.29 
(15) 



6.16 
(20) 



6.21 
(15) 



6.16 
(20) 



5.93 
(26) 



6.95 
(14) 



6.00 
(20) 



6.22 
(11) 



6.00 
(20) 



5.81 
(26) 



Human 



8.25 11.17 

(1950) (1951) 

8.88 12.02 

(1628) (1613) 



11.17 
(1951) 
12.02 
(1613) 



8.66 11.17 

(1944) (1951) 

10.45 12.02 

(1608) (16.13) 



12.40 
(1431) 



11.71 
(1441) 



11.71 
(1441) 



10.45 
(1431) 



11.71 
(1441) 



a The numbers above them is the corresponding Z-score. Level indicates the 
maximum number of allowed levels. 

platform specific instructions on how to compile and run 
applications that use CPLEX libraries. The results of our 
code using the penalty schemes described in this paper for 
TRNs of E. coli, H. sapiens and S. cerevisiae can be found 
in Additional file 2. 



Table 4 Comparison of the global hierarchy of 
subnetworks to their local hierarchy 


Subnetwork 


Num. 


Global 


Local 






Level 


Penalty Z-score Penalty 


Z-score 




3 


4 


3.2 3 


4.2 




4 


2 


3.3 1 


4.0 


Cell Cycle 


5 


2 


3.2 0 


3.7 




6 


2 


3.1 0 


3.7 




7 


2 


3.0 0 


3.7 




8 


2 


2.9 0 


3.7 




3 


8 


0.7 2 


3.5 




4 


6 


1.2 1 


3.8 


Meiosis 


5 


6 


1.2 1 


3.8 




6 


5 


1.6 1 


3.8 




7 


5 


1.5 1 


3.8 




8 


5 


1.5 1 


3.8 



a The experiment is conducted on the two subnetworks of S. cerevisiae, namely 
cell cycle and meiosis. "Num. Level" denotes the maximum number of 
allowed levels. 
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Additional files 



Additional file 1 : A C++ implementation of the algorithm developed 
in this paper. 

Additional file 2: The resulting level assignments for the 
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